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Abstract 



We consider a hidden Markov model with multiple observation processes, 
one of which is chosen at each point in time by a policy — a deterministic 
function of the information state — and attempt to determine which policy 
minimises the limiting expected entropy of the information state. Focusing 
on a special case, we prove analytically that the information state always 
converges in distribution, and derive a formula for the limiting entropy which 
can be used for calculations with high precision. Using this fomula, we find 
computationally that the optimal policy is always a threshold policy, allowing 
it to be easily found. We also find that the greedy policy is almost optimal. 
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1 Introduction 



A hidden Markov model is an underlying Markov chain together with an 
imperfect observation on this chain. In the case of multiple observations, 
the classical model assumes that they can be observed simultaneously, and 
considers them as a single vector of observations. However, the case where 
not all the observations can be used at each point in time often arises in 
practical problems, and in this situation, one is faced with the challenge of 
choosing which observation to use. 

We consider the case where the choice is made as a deterministic function of 
the previous information state, which is a sufficient statistic for the sequence 
of past observations. This function is called the policy, which we rank ac- 
cording to the information entropy of the information state that arises due 
to that policy. 

Our main results are: 

• The information state converges in distribution for almost every under- 
lying Markov chain, as long as each observation process gives a perfect 
information observation with positive probability; 



In a special case (see Section 2.3 for a precise definition), we can write 
down the limiting entropy explicitly as a rational function of subgeo- 
metric infinite series, which allows the calculation of limiting entropy 
to very good precision; 

Computational results suggest that the optimal policy is a threshold 
policy, hence finding the optimal threshold policy is sufficient for finding 
the optimal policy in general; 



• Finding a locally optimal threshold policy is also sufficient, while find- 
ing a locally optimal general policy is sufficient with average probability 
0.98; and 

• The greedy policy is optimal 96% of the time, and close to optimal the 
remaining times, giving a very simple yet reasonably effective subopti- 
mal alternative. 
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1.1 Motivation 



The theory of hidden Markov models was first introduced in a series of pa- 
pers from 1966 by Leonard Baum and others under the more descriptive 
name of Probabilistic Functions of Markov Chains [Ij. An apphcation of 
this theory was soon found in speech recognition, spurring development, and 
the three main problems — probability calculation, state estimation and pa- 
rameter estimation — had essentially been solved by the time of Lawrence 
Rabiner's influential 1989 tutorial paper [T3] . 

The standard hidden Markov model consists of an underlying state which is 
described by a Markov chain, and an imperfect observation process which 
is a probabilistic function of this underlying state. In most practical exam- 
ples, this single observation is equivalent to having multiple observations, 
since we can simply consider them as a single vector of simultaneous obser- 
vations. However, this requires that these multiple observation can be made 
and processed simultaneously, which is often not the case. 

Sometimes, physical constraints may prevent the simultaneous use of all of 
the available observations. This is most evident with a sensor which can op- 
erate in multiple modes. For example, a radar antenna must choose a wave- 
form to transmit; each possible waveform results in a different distribution 
of observations, and only one waveform can be chosen for each pulse. An- 
other example might be in studying animal populations, where a researcher 
must select locations for a limited pool of detection devices such as traps and 
cameras. 

Even when simultaneous observations are physically possible, other con- 



9 



straints may restrict their availability. For example, in an application where 
processors are much more expensive than sensors, a sensor network might 
reasonably consist of a large number of sensors and insufficient processing 
power to analyse the data from every sensor, in which case the processor 
must choose a subset of sensors from which to receive data. Similarly, a 
system where multiple sensors share a limited communication channel must 
decide how to allocate bandwidth, in a situation where each bit of bandwidth 
can be considered a virtual sensor, not all of which can be simultaneously 
used. 

Another example is the problem of searching for a target which moves ac- 
cording to a Markov chain, where observation processes represent possible 
sites to be searched. Indeed, MacPhee and Jordan's |Tl] special case of this 



problem exactly corresponds to the special case we consider in Section 2.3 
although with a very different cost function. Johnston and Krishnamurthy 
[7] show that this search problem can be used to model file transfer over a 
fading channel, giving yet another application for an extended hidden Markov 
model with multiple observation processes. 

Note that in the problem of choosing from multiple observation processes, it 
suffices to consider the case where only one observation is chosen, by consid- 
ering an observation to be an allowable subset of sensors. The three main 
hidden Markov model problems of probability calculation, state estimation 
and parameter estimation remain essentially the same, as the standard al- 
gorithms can easily be adapted by replacing the parameters of the single 
observation process by those of whichever observation process is chosen at 
each point in time. 
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Thus, the main interesting problem in the hidden Markov model with multi- 
ple observation processes is that of determining the optimal choice of obser- 
vation process, which cannot be adapted from the standard theory of hidden 
Markov models since it is a problem that does not exist in that framework. 
It is this problem which will be the focus of our work. 

We will use information entropy of the information state as our measure of 
optimality. While Evans and Krishnamurthy |6] use a distance between the 
information state and the underlying state, it is not necessary to consider 
this underlying state explicitly, since the information state is by definition 
an unbiased estimator of the distribution of the underlying state. We choose 
entropy over other measures such as variance since it is a measure of uncer- 
tainty which requires no additional structure on the underlying set. 

The choice of an infinite time horizon is made it order to simplify the problem, 
as is our decision to neglect sensor usage costs. These variables can be 
considered in future work. 
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1.2 Past Work 



The theory of hidden Markov models is aheady well-developed |l13J. On 
the other hand, very little research has been done into the extended model 
with multiple observation processes. The mainly algorithmic solutions in the 
theory of hidden Markov models with a single observation process cannot be 
extended to our problem, since the choice of observation process does not 
exist in the unextended model. 

Similarly, there is a significant amount of work on the sensor scheduling 
literature, but mostly considering autoregressive Gaussian processes such as 
in [15]. The case of hidden Markov sensors was considered by Jamie Evans 
and Vikram Krishnamurthy in 2001 [6], using policies where an observation 
process is picked as a deterministic function of the previous observation, and 
with a finite time horizon. They transformed the problem of choosing an 
observation into a control problem in terms of the information state, thereby 
entering the framework of stochastic control. They were able to write down 
the optimal policy as an intractible dynamic programming problem, and 
suggested the use of approximations to find the solution. 

Krishnamurthy ^ followed up this work by showing that this dynamic pro- 
gramming problem could be solved using the theory of Partially Observed 
Markov Decision Processes when the cost function is of the form 

c{z) = Y,^{^\\m-A\^ 

i 

where z is the information state, 5{i) G V{S) is the Dirac measure and || ■ || 
is a piecewise constant norm. It was then shown that such piecewise linear 
cost functions could be used to approximate quadratic cost functions, in the 
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sense that a sufficiently fine piecewise linear approximation must have the 
same optimal policy. In particular, this includes the Euclidean norm on 
the information state space, which corresponds to the expected mean-square 
distance between the information state and the distribution of the underlying 
chain. However, no bounds were found on how fine an approximation is 
needed. 

The problem solved by Evans and Krishnamurthy is a similar but different 
problem to ours. We consider policies based on the information state, which 
we expect to perform better than policies based on only the previous obser- 
vation, as the information state is a sufficient statistic for the sample path 



of observations (see Proposition 2.8, also [E]). We also consider and infinite 
time horizon, and specify information entropy of the information state as 
our cost function. Furthermore, while Evans and Krishnamurthy consider 
the primary tradeoff as that between the precision of the sensors and the 
cost of using them, we do not consider usage costs and only aim to minimise 
the uncertainty associated with the measurements. 

Further work by Krishnamurthy and Djonin [9j extended the set of allow- 
able cost functions to a Lipschitz approximation to the entropy function, 
and proved that threshold policies are optimal under certain very restrictive 
assumptions. Their breakthrough uses lattice theory methods [IB] to show 
that the cost function must be monotonic in a certain way with respect to 
the information state, and thus the optimal choice of observation process 
must be characterised by a threshold. However, this work still does not 
solve our problem, as their cost function, a time-discounted infinite sum of 
expected costs, differs significantly from our limiting expected entropy, and 
furthermore their assumptions are difficult to verify in practice. 
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Another similar problem was also considered by Mohammad Rezaeian [Ti] . 
who redefined the information state as the posterior distribution of the un- 
derlying chain given the sample path of observations up to the previous, as 
opposed to current, time instant, which allowed for a simplification in the 
recursive formula for the information state. Rezaeian also transformed the 
problem into a Markov Decision Process, but did not proceed further in his 
description. 



The model for the special case we consider in Section 2^ is an instance of 
the problem of searching for a moving target, which was partially solved by 
MacPhee and Jordan [llj with a very different cost function - the expected 
cumulative sum of prescribed costs until the first certain observation. They 
proved that threshold policies are optimal for certain regions of parameter 
space by analysing the associated fractional linear transformations. Unfor- 
tunately, similar approaches have proved fruitless for our problem due to the 
highly non-algebraic nature of the entropy function. 

Our problem as it appears here was first studied in unpublished work by 
Bill Moran and Sofia Suvorova, who conjectured that the optimal policy is 
always a threshold policy. More extensive work was done in [18], where it 
was shown that the information state converges in distribution in the same 



special case that we consider in Section 2.3 It was also conjectured that 
threshold policies are optimal in this special case, although the argument 
provided was difficult to work into a full proof. However, [18] contains a 



mistake in the recurrence formula for the information state distribution, a 



corrected version of which appears as Lemma 2.13 The main ideas of the 
convergence proof still work, and are presented in corrected and improved 
form in Section [2l2l 
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2 Analytic Results 



2.1 Definitions 



We begin by precisely defining the model we will use. In particular, we will 
make all our definitions within this section, in order to expediate referencing. 



For the reader's convenience. Table 2.1 at the end of this section lists the 
symbols we will use for our model. 

For a sequence Xq, Xi, . . . and any non- negative integer t G Z+, we will use 
the notation to represent the vector (Xq, . . . , Xj) . 

Definition 2.1. A Markov Chain [12] is a stochastic process [Xt)tez+, 
such that for all times t > s, all states x and all measurable sets A, 

F{Xt e A\Xs = x,Fs) = ^{Xt G A\X, = x), 

where J-j denotes the canonical filtration. We will consistently use the symbol 
Xt to refer to an underlying Markov chain, and vr^ = P(Xf) to denote its 
distribution. 

In the case of a time-homogeneous, finite state and discrete time Markov 
chain, this simplifies to a sequence of random variables (X^) t&+ taking values 
in a common finite state space = {1, . . . , ra}, such that for all times t G Z"*", 
Xf+i is conditionally independent of Xi^t-i) given X^, and the distributions 
of Xt+i given Xt does not depend on t. 

In this case, there exists n x n matrix T, called the Transition Matrix, 
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such that for all i,j E S and t G Z+, 

T,, = F{Xt+i =j\Xt = i)= P(Xi+i =j\Xt = (2.1) 

Since we mainly consider Markov chains which are time-homogeneous and 
finite state, we will henceforth refer to them as Markov chains without the 
additional qualifiers. 

Definition 2.2. An Observation Process on the Markov chain {Xt)tez+ 
is a sequence of random variables [Yt)tez+ given by Yt = c{Xt,Wt), where 
c is a deterministic function and (W^t)^g2+ is a sequence of independent and 
identically distributed random variables which is also independent of the 
Markov chain {Xt)tez+ [S]- 

As before, we will only consider observation processes which take values in a 
finite set \^ = {1, . . . , m}. Similarly to before, there exists an m x n matrix 
M, which we call the Observation Matrix, such that for all i, j E S , k E V 

and t G Z"*", 

M,k = F{Yt = k\Xt=j)= F{Yt = k\Xt=j, ; 
T,, = P(Xi+i =j\Xt = t)= P(Xi+i =j\Xt = t, X(t_i), Y^t)) ■ 

Heuristically, these two conditions can be seen as requiring that observa- 
tions depend only on the current state, and do not affect future states. A 



diagrammatic interpretation is provided in Figure 2.1 
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Figure 2.1: An observation process (it) on a Markov chain (Xj) . 
At each node Xf, everything after Xt is conditionally independent 
of everything before Xt, given Xt. 



Traditionally, a hidden Markov model is defined as the pair of a Markov chain 
and an observation process on that Markov chain. Since we will consider 
hidden Markov models with multiple observation processes, this definition 
does not suffice. We adjust it as follows. 

Definition 2.3. A Hidden Mcirkov Model is the triple of a Markov 
chain [Xt)t^z+, a finite collection of observation processes { (^/'^)tGZ+}jgQ 
on (Xt)tg2+, and an additional sequence of random variables {lt)tez+, called 
the Observation Index, mapping into the index set O. 

Note that this amends the standard definition of a hidden Markov model. For 
convenience, we will no longer explicitly specify our hidden Markov models 
to have multiple observation processes. 

It makes sense to think of {Xt)tez+ as the state of a system under observation, 
{(^/*^)tez+}jgo ^ ^ collection of potential observations that can be made 
on this system, and {lt)tez+ as a choice of observation for each point in time. 

Since our model permits only one observation to be made at each point in 
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time, and we will wish to determine which one to use based on past obser- 
vations, it makes sense to define [lt)tez+ ^ sequence of random variables 
on the same probability space as the hidden Markov model. 

We will discard the potential observations which arc not used, leaving us with 
a single sequence of random variables representing the observations which are 
actually made. 

Definition 2.4. The Actual Observation of a hidden Markov model is 

the sequence of random variables (Y}^*^)t&z- 

We will write Yt to mean Y}'''\ noting that this is consistent with our notation 
for a hidden Markov model with a single observation process Yt. On the other 
hand, for a hidden Markov model with multiple observation processes, the 
actual observation (yt)tei.+ is not itself an observation process in general. 

Since our goal is to analyse a situation in which only one observation can be 
made at each point in time, we will consider our hidden Markov model as 

consisting only of the underlying state {Xt)tez+ and the actual observation 
(^)tez+- Where convenient, we will use the abbreviated terms state and 
observation at time t to mean Xf and Y^ respectively. 

For any practical application of this model to a physical system, the under- 
lying state cannot be determined, otherwise there would be no need to take 
non-deterministic observations. Therefore, we need a way of estimating the 
underlying state from the observations. 

Definition 2.5. The Information State Realisation of a hidden Markov 
model at time t is the posterior distribution of Xt given the actual observa- 
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tions and observation indices up to time t. 

To make this definition more precise, we introduce some additional notation. 

First, recall that {Xt)t^z+ has state space S — {1, . . . , n}, and define the set 
of probability measures on S, 

V{S) - {(pi, . . . e : > V i e 5, = i}- (2-3) 

Second, for a random variable X with state space and an event define 
the posterior distribution of X given 

P(X I E) = (f{X = l\E) , ... , P(X = n\E)^ e V{S). (2.4) 

Although we make this definition in general, we purposely choose the letters 
X and 5", coinciding with the letters used to represent the underlying Markov 
chain and the state space, as this is the context in which we will use this 
definition. Then, the information state realisation is a function 

(2.5) 

^t{y{t)\Ht)) = ^{Xt I i^(t) = y(t), /(*) = ^(t)). 

This extends very naturally to a random variable. 

Definition 2.6. The Information State Random Vciriable is 

Its distribution is the Information State Distribution jit — P(^t), taking 
values in V{V{S)), the space of Radon probability measures on V{S), which 
is a subset of the real Banach space of signed Radon measures on V{S). 
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Thus, the information state reahsation is exactly a reahsation of the infor- 
mation state random variable. It is useful because it represents the maximal 
information we can deduce about the underlying state from the observation 



index and the actual observation, as shown in Proposition 2.8 



For the purpose of succinctness, we will refer to any of Zt, Zf and fit as simply 
the Information State when the context is clear. 

Definition 2.7. A random variable Z is a sufficient statistic for a pa- 
rameter X given data Y if for any values y and z of Y and Z respectively, 
the probability F(Y = y\Z = z,X = x) is independent of x [3]. As before, 
we make the definition in general, but purposely choose the symbols X, Y 
and Z to coincide with symbols already defined. 

In our case, X, which is a random variable, is used in the context of a 
parameter. Our problem takes place in a Bayesian framework, where the 
information state represents our belief about the underlying state, and is 
updated at each observation. 

Proposition 2.8. The information random variable Zt is a sufficient 
statistic for the underlying state Xt, given the actual observations Y(^t) O'nd 
the observation indices I(t). 



Proof. By Definition 2/7, we need to prove that for all y G V^^^ and i G O*^^, 
P(F, l\Z,x)= P(F(t) = y, I^t) =i\Zt = zt{y ■,%),Xt = x) (2.6) 
is independent of x. 

First, note that the event {Zj = zt{ij\%)] is the disjoint union of events 
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{Y(t) = y', I^t) = i'} over all (y', i') e V'+' x 0*+^ such that zt(y' ; i') = zt{y ; i)- 



Next, if Zt{y ; i) = 2;t(y' ; i'), then by definition of Zt, for all x & S, 

F{Xt^x\ Y(t) = y, 7(t) ^i)^F{Xt^x\ Y(t) = y', 7(i) = i') . (2.7) 



Then, by definition of conditional probability, 

P(x, = X, Yt^t) = y', ijt) = t') _ FjY^t) = y', h) = i') 

F{Xt = X, Y^t) = y, kt) = ^) F(l^(t) = = ^) ' 



Hence, 



P(r, / I Z, x) = P(F(t) = /(t) = 2 1 = zt{y ■,i),Xt = x) 

_ ^iXit) = y, Ijt) =i,Zt = ztjy ■,i),Xt^ x) 
F{Zt^Zt{y;i),Xt^x) 

P(y(i)=y,7(i)=i,X, = x) 



Ey',MY(t)^y'J(t)^i':X,^x) 

F{Y^t)=y',I(^t)=t',Xt = x) 



= E 



7- F{Y^t)=y,Iit)=t,Xt = x) 



y 



2/ 



(2.^ 



Each sum above is taken over all {y', i') e 1/*+^ x 0*+^ such that Ztiy' ; i') = 
Zt{y;i). This expression is clearly independent of x, which completes the 
proof that Zt is a sufficient statistic for Xf. □ 



Since the information state represents all information that can be deduced 
from the past, it makes sense to use it to determine which observation process 
to use in future. 

Definition 2.9. A policy on a hidden Markov model is a deterministic 
function g : V{S) — >■ O, such that for all t e Z"*", It+i = g{Zt). We will use 
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the symbol Ai = g^^{i} to denote the preimage of the observation method i 
under the pohcy, that is, the subset of V{V{S)) on which observation method 
i is prescribed by the pohcy. We will always consider the policy g as fixed. 

Since Zt is a function of Y(t) and /(t), this means that I{t+i) is a function of 
Y(t) and /(f). Then by induction, we see that I(t+i) is a function of Y^t) and 
Iq. Therefore, if we prescribe some fixed Iq, then If is a function of Y^t). 

For fixed Iq, we can write 

Zt = F{Xt I = F{Xt I (2.10) 

Hence, the information random variable is a deterministic function of only 
Y(t). In particuar, the information state Zt can be written with only one 
argument, that is, Zt = Zt{Y^t))- 

Since our aim is to determine the underlying state with the least possible 
uncertainty, we need to introduce a quantifier of uncertainty. There are 
many possible choices, especially if the state space has additional structure. 
For example, variance would be a good candidate in an application where 
the state space embeds naturally into a real vector space. 

However, in the general case, there is no particular reason to suppose our 
state space has any structure; our only assumption is that it is finite, in 
which case information entropy is the most sensible choice, being a natu- 
ral, axiomatically-defined quantifier of uncertainty for a distribution on a 
countable set without any additional structure [1]. 

Definition 2.10. The Information Entropy of a discrete probability 
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measure (pi, . . . G 7^(5*) is given by 

H{{pi,. . . ,Pn)) = - ^Pjlogpj. 

j 

We will use the natural logarithm, and define log = in accordance with 
the fact that plogp — )■ as p — )■ 0. 

Since Zt takes values in V{S), H{Zt) is well-defined, and by definition mea- 
sures the uncertainty in Xt given Y(t)i and therefore by Proposition 2.8, mea- 
sures the uncertainty in our best estimate of Xt. Thus, the problem of 
minimising uncertainty becomes quantified as one of minimising H[Zt). 

We are particularly interested in the limiting behaviour, and thus, the main 
questions we will ask are: 

• Under what conditions, and in particular what policies, does H{Zt) 
converge as t — > oo? 

• Among the policies under which H{Zt) converges, which policy gives 
the minimal limiting value of H{Zt)'! 

• Are there interesting cases where H{Zt) does not converge, and if so, 
can we generalise the above results? 
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Symbol 


Value 


Meaning 


V{S) 


set 


probability measures on S 


nns)) 


set 


probability measures on V{S) 


A 


set 


region of observation process i 


H 


function 


information entropy 


It 


random variable 


observation index 


O 


finite set 


set of observation processes 


s 


finite set 


state space of Markov chain 


V 


finite set 


observation space 


Wt 


random variable 


observation randomness 


Xt 


random variable 


Markov chain 




random variable 


observation process 


Yt 


random variable 


actual observation y}^*^ 


Zt 


random variable 


information state random variable 


9 


function 


policy 


m 


integer 


number of observation values 


n 


integer 


number of states 


t 


integer 


position in time 




distribution 


information state realisation 


T^t 


distribution 


Markov chain distribution 


l^t 


distribution 


information state distribution 



Table 2.1: List of symbols, ordered alphabetically. 
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2.2 Convergence 



In this section, we will prove that under certain conditions, the information 
state converges in distribution. This fact is already known for classical hidden 
Markov models, and is quite robust: LeGland and Mevel [10] prove geometric 
ergodicity of the information state even when calculated from incorrectly 
specified parameters, while Cappe, Moulines and Ryden [2] prove Harris 
recurrence of the information state for certain uncountable state underlying 
chains. We will present a mostly elementary proof of convergence in the case 
of multiple observation processes. 

To determine the limiting behaviour of the information state, we begin by 
finding an explicit form for its one-step time evolution. 

Definition 2.11. For each observation process i and each observed state 
y, the r-function is the function ri^y : 7^(5*) — ?■ V{S) given by 

where 5 : S ^ V{S) is the Dirac measure on 5* and Zj is the jth component 
of 2 e V{S) c M". 

Lemma 2.12. In a hidden Markov model with multiple observation pro- 
cesses and a fixed policy g, the information state satisfies the recurrence re- 
lation 

zt+i{y{t+i)) = (2.11) 
Proof. Let it+i = g{zt{y(t))) and kt = ^{Yl^tf ^^ = y(t)). By the Markov 
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property as in Definition 2.2, and the simplification (2.10), 



1 



X P(Xi+i = a; I = j)P(Xi = j, = yit)) 



kt 



H+1 



since kt/kt+i does not depend on x and 2;t+i(i/(t+i))a; = 1- 



(2.12) 

□ 



Note that for each information state z and each observation process there 
are at most m possible information states at the next step, which are given 
exphcitly by ri^y{z) for each observation y & V. 

Lemma 2.13. The information distribution satisfies the recurrence relation 

/^*+i = EE / {^■T-M%6{r,,y{z))df,t{z), 
ieo yev 

where the sum is taken over all observation processes i and all observation 
states y, 6 : V{S) — )■ V{V{S)) is the Dirac measure on V{S), and ■ is the 
matrix product considering z G V{S) C M" as a row vector. 

Proof. Since Zt = V{^Xt\Y[t)) is a deterministic function of Y{t)) given that 

Y(t+i) = y{t+i), 

Zt+i = zt+i{y(t+i)) = rg(^,(y(,,)),j,,+i(zt(?/(t))). (2.13) 
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This depends only on Zt{y(t)) and yt+i, so given that Zt = z and l^+i = y, 

Zt+i = rg(,),y{z). (2.14) 
Integration over {Zt,Yt^i) G V{S) x V gives 

= EE/ 5(^.(.),.(^))P(>^t+i=2/|^t = ^)rf/it(^)- (2.15) 



By Definition 2.5, is the posterior distribution of Xt given the observations 
up to time t, so P(X( = = 2;) = 2:2;, the a;th coordinate of the vector 
z e V{S) C M". Since Zt is a function of Y(f), which is a function of X(^t) and 
the observation randomness W{t), by the Markov property as in Definition 
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P(Ft+i = y\Zt = z)=Y, [Yt+i = y\Xt = x)¥{Xt = x\Zt = z) 

= ^ (T ■ M%,yz,, = {z-T- M»),. (2.16) 



Substituting (2.16) into (2.15) completes the proof. □ 



Note that Lemma 2.13 shows that the information distribution is given by a 
linear dynamical system on V{V{S)), and therefore the information state is 
a Markov chain with state space V{S). We will use tools in Markov chain 
theory to analyse the convergence of the information state, for which it will 
be convenient to give a name to this recurrence. 

Definition 2.14. The transition function of the information distribution 
is the deterministic function F : V(V{S)) — > V{V{S)) given by F{nt) = fit+i, 



extended linearly to all of V(V{S)) by the recurrence in Lemma 2.13 The 
coefficients ai,y{z) = (^z ■ T ■ M^'^^^y are called the a-functions. 
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We now give a criterion under which the information state is always positive 
recurrent. 

Definition 2.15. A discrete state Markov Chain Xj is called Ergodic if 
it is irreducible, aperiodic and positive recurrent. Such a chain has a unique 
invariant measure vr, which is a limiting distribution in the sense that Xt 
converges to tt in total variation norm |12j . 

Definition 2.16. A discrete state Markov Chain Xt is called Positive 
if every transition probability is strictly positive, that is, for all i,j G S, 
F{Xt+i = i\Xt = j) > 0. This is a stronger condition than ergodicity. 

Definition 2.17. We shall call a hidden Markov model Anchored if the 

underlying Markov chain Xt is ergodic, and for each observation process i, 
there is a state Xj and an observation i/i such that Mx^y^ > and Mxl, = 
for all X 7^ Xj. The pair (xj, yi) is called an Anchor Pair. 

Heuristically, the latter condition allows for perfect information 6{xi) when- 
ever the observation is made using observation process i. This anchors the 
information chain in the sense that this state can be reached with positive 
probability from any other state, thus resulting in a recurrent atom in the 
uncountable state chain Zt. On the other hand, since each information state 
can make a transition to only finitely many other information states, start- 
ing the chain at 6{xi) results in a discrete state Markov chain, for which it 
is much easier to prove positive recurrence. 

Lemma 2.18. In an anchored hidden Markov model, for any anchor pair 
{xi.yi), Ti^yXz) = 5{xi) for all z E V{S). 
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Proof. When x ^ M^l^ = by Definition 2.17, so every term in the 



numerator of Definition 2.11 is zero except the coefficient of 6{xi). Since we 



know the coefficients have sum 1, it follows that ri^y. = 6{xi). □ 

Lemma 2.19. In a positive, anchored hidden Markov model, the a-functions 
cxi^y., for each i E O, are uniformly bounded below by some e > 0, that is, 
oii,yi{z) > e for all i and z. 



2.14 



and 



2.17 



Proof We can write ai^y^{z) = ZxTx,x,M^%^ by Definitions 
which is bounded below by minx T^^XiMxly^ since ^^Zx = 1. Since each 
Mxlyi > 0, if all the entries of T are positive, then ai^y-{z) is bounded below 
uniformly in z for fixed i, which then implies a uniform bound in z and i 
since there are only finitely many i. □ 

Definition 2.20. For each state x G S", the Orbit i?^ of 6{x) G 7^(3) under 
the r-functions is 

Rx = {6{x)} U {ri,y{6{x)) : aiJ6{z)) > O} 

U {ri',y' ori^y{6{x)) : ai',yiri^y{6{x)))ai,y{6{x)) > O} 
U ■ ■ ■ . 

By requiring the a-functions to be positive, we exclude points in the orbit 
which are reached with zero probability. Let R = Rx- 

Proposition 2.21. In a positive, anchored hidden Markov model, there 
exists a constant < A < 1 such that for all measures Z G V{V{S)), the mass 
of the measure F^{Z) outside R is bounded by A*, that is, F^{Z){R^) < A*. 
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Proof. We can rewrite Definition |2.14| as 



F{Z)= [ Q{z)dZ{z), (2.17) 

JV(S) 

where 

i y 

In this notation, the integral is the Lebesgue integral of the function Q with 
respect to the measure Z. Since Q takes values in the V{V{S)) and Z is 
a probability, the integral also takes values in Ai(V{S)), thus F maps the 
information state space V{V{S)) to itself. 

Since Q{z) is a measure supported on the set of points reachable from z via 
an r-function, and i? is a union of orbits of r-functions and therefore closed 
under r-functions, it follows that all mass in R is mapped back into R under 
the evolution function, that is 

j Q{z)dZ{z)^ (R) = Z{R) = 1 - ZiR"). (2.19) 

On the other hand, by Lemma [2.18[ ri^y^{z) = 5{xi) E R for all z, hence 

Q{z)dZ{z)]{R) > inf {Qiz)){R) 



^ (lA,(,)(2;)ag{.),j,,(,)(2;)5(rj;(.),y^(^)(z))j(i?) 
= miag^,)^y (z) > miai^yXz). (2.20) 
Putting these together gives 

F{Z){R)=(^j Q{z)dZ{z) + j Q{z)dZ{z)yR) 

>l-(l-infa,,j,^(^))Z(i?^). (2.21) 
Setting A = l-inf^,^ ai,yAz) gives F{Z){R^) < \Z{R^), hence F\Z){R'-) < A* 



by induction. By Lemma 2.19, A < 1, while A > since we can always choose 



a larger value. □ 
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Up to this point, we have considered the evolution function as a deterministic 
function F : V{V{S)) — t- V{V{S)). However, we can also consider it as a 



probabilistic function F : V{S) — ?■ V{S). By Definition 2.14, F maps points 
in R to -R, hence the restriction F\b, : R ^ R gives a probabilistic function, 
and therefore a Markov chain, with countable state space R. 

By Proposition |2.21[ the limiting behaviour of the information chain takes 
place almost entirely in R in some sense, so we would expect that convergence 
of the restricted information chain -F|r is sufficient for convergence of the full 
information chain F. This is proved below. 

Proposition 2.22. In a positive, anchored hidden Markov model, under 
any policy, the chain has at least one state of the form 5{xi) which is 
positive recurrent, that is, whose expected return time is finite. 



Proof. Construct a Markov chain P on the set {Ai} U {Rx}-, with transition 
probabilities P{Ai,Rxi) > for all i, P{Rx,Ai) > whenever R^ fl Ai is 
nonempty, and all other transition probabilities zero. We note that this is 
possible since we allow each state a positive probability transition to some 
other state. 

Since P is a finite state Markov chain, it must have a recurrent state. Each 
state Rx can reach some state Ai, so some state Ai is recurrent; call it Ai. 

Consider a state zq = r'^'^{6{x2)) G Rx2 ^ Rof the chain F which is reachable 
from 6{xi), where r'^^ is a composition of k2 r-functions with corresponding 
a-functions nonzero. Since the Ai partition V{S), one of them must contain 
zo; call it ^3. We will assume A3 Ai, the proof follows the same argument 
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and is simpler in the case when zq E Ai. 

By definition of the a-f unctions, 

P(Zi = 5ix3)\Zo = zo) = as,y,{zo) > 0. (2.22) 

This means that ^(xa) is reachable from 6{xi) in the chain F, hence in the 
chain P, A3 is reachable from Ai, and by recurrence of Ai, Ai must also be 
reachable from ^3 via some sequence of positive probability transitions 

As ^ R^, ^ A^ ^ R,, y A^. (2.23) 

By definition of P{Rx^,A4) > 0, fl A^ is nonempty, and thus contains 
some point r^3(5(x3)), where r'^^ is a composition of r-functions with 
corresponding a nonzero. 



By Definition 2.20, each transition r^{6{xs)) to r^^^{6{x3)) in the information 
chain occurs with positive probability, so 

P(Zfc3+i = r''^{6{x3))\Zi = 6{xs)) = /^s > 0. (2.24) 

Since r''^{6{x3)) G A4, by anchoredness and positivity, 

P(Zfc3+2 = 5(x4)|Zfc3+i = r'^iSixs))) = 73 > 0. (2.25) 
The Markov property then gives 

P(Zfc3+2 = S{x,)\Zo = Zo) = a3,,3(zo)/3373 > 0. (2.26) 
Continuing via the sequence ( 2.23[ ), we obtain 



P(Zfc^+...+fc.+j+i = 5{xi)\Zo = Zo) = as^y,^{zo)l33-f3 ■ ■ ■ Pj^yj > 0. (2.27) 

Thus, for every state z E R reachable from 6{xi), we have found constants 
G N and > such that 

F{Z,^=5{xi)\Zo = z) =c,. 
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By Lemma 2.19, a3^y^{z) is uniformly bounded below, while Psjs ■ ■ ■ Pjjj 



depends only on the directed path (2.23) and not on z, and thus is also 



uniformly bounded below since there are only finitely many Ai, and hence it 
suffices to choose finitely many such paths. 



Similarly, Sz also depends only on the directed path (2.23), and thus is uni- 
formly bounded above. In particular, it is possible to pick Sz and such 
that s = supz Sz < oo and c = inf^ Cz > 0. 

Let r be the first entry time into the state 6{xi). By the above bound, we 
have P(r > s|Zo = -z) < 1 — c for any initial state z reachable from S{xi). 
Letting r' and Z' be independent copies of r and Z, the time-homogeneous 
Markov property gives 

P(r > (k + l)s\Zo = 2) / . . , X 

^ , ^ / ' V ^ = P r > A; + 1 s r > ks, Zq = z) 

F{t> ks\Zo = z) ^ ^ ^1 ' ; 

= ^P(r > (A; + l)s|r > ks,Zks = z' , Zq = z) 
x¥[Zks = z'\t > ks,ZQ = z) 

< supP(r > {k + 1)s\t > ks, Z^s = z) 

z' 

= supP(r' > sIZq = 

z' 

< 1 - c. (2.28) 

By induction, P(r > A;s|Zo = 2) < (1 — cY for all k. Dropping the condition 
on the initial distribution Zq for convenience, we have 

E[r] = ^ P(r > A;) = ^ ^ P(r > A;s + 1) 
fcez+ A;ez+ 0<t<s 

<^ ^Hr>ks)<^s{l-cf = - <oo. (2.29) 

k&i+ o<t<s fcez+ 

In particular, E[r|Z = 5(xi)] < 00, so 6{xi) is a positive recurrent state. □ 
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Lemma 2.23. The transition function F, considered as an operator on the 

real Banach space Ai{V{S)) of signed Radon measures on V{S) with the total 
variation norm, is linear with operator norm \ \F\\ = 1. 

Proof. Linearity follows immediately from the fact that F is defined as a 
finite sum of integrals. 

For each e M.{V{S)), let /j, — /j,'^ — iJ,~ he the Hahn decomposition, so 

that = ll/i^ll + ll/x^ll = ii^{V{S)) + n'^iV^S)) by definition of the total 
variation norm. 

If n+{V{S)) = 0, then Fip+){V{S)) = by linearity of F. Otherwise, let 
c = iJ&{V{S)), so that ^/x"*" e V{V{S)). Since F maps probabihty measures 
to probability measures, = c||F(^//+)|| — c— H/i"^!!, and similarly 

for /x", hence by linearity of F and the triangle inequality, 

< ||F(//+)|| + = + = (2.30) 

This shows that ||F|| < 1. Picking ^ to be any probability measure gives 
= 1 = 11/^11, hence ||F|| = L □ 

Theorem 2.24. In a positive, anchored hidden Markov model, irreducibil- 
ity and aperiodicity of the restricted information chain F\ji is a sufficient 
condition for convergence in distribution of the information state Zt to some 
discrete invariant measure /Xqo ^ 'P{R) C V{V{S)). 

Proof. In this case, the restricted chain F\r is irreducible, aperiodic and posi- 
tive recurrent — that is, ergodic — and hence has a unique invariant probability 
measure //oo, which can be considered as an element of V{V{S)) supported 
on i? C V{S). 

34 



We will show that the information distribution /it converges in total variation 
norm to fioo- Fix /io G V{V{S)) and e > 0, and pick s such that A** < |, with 
A as defined in Proposition |2.2l[ 

Let ^s\r be the restriction of the probability measure to R, which is a 
positive measure with total mass m = /is(-R) > 1 — A* > 0, hence we can 
divide to obtain the probability measure fi' = ^/^s|_r supported on R. 

Since F\ji is ergodic and /i' is a probability measure, F^{fi') = 
converges to fi^o in total variation norm, so pick K such that for all k > K, 

||F'=(/i')-/ioo|| < |. 

For any t > K + s, we have the triangle inequality bound 

||/it - Aioo| I < \ \f^t- + ||F*"''(m/i') - m/ioo|| + | {mfioo - f^oo\\- 

(2.31) 



By Lemma 2.23, F is linear with operator norm 1, so by Proposition 2.21 



Wfxt- F'-'imfi')\ \ < \\F\\'-'\\fi, - i^^IrW 

= \\fis\R^\\ = fis{Rl < y < I- (2.32) 

Since t — s > K , again using linearity and the fact that m < 1, 

\\F'-'{mfi')-mfiJ \ =m\\F'-'{fi')-fiJ \ < f. (2.33) 



Finally, again using Proposition 2.21 



||m/ioo - /ioo|| = (1 - m)||/ioo|| = 1 - fJ's{R) < A'' < |. (2.34) 

We see that for all t > K + s, \ \l^t — f^oo] | < e, so the information state /ij 
converges to yUoo in total variation norm and therefore in distribution. □ 
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Conjecture 2.25. The conditions of Theorem 2.24 can be weakened to the 
case when the underlying chain is only ergodic. 



Idea of Proof. Given an ergodic finite transition matrix T, some power T*^ 
will be positive, and thus we should still have convergence by taking time 
in fc-step increments, since the information state will not fluctuate too much 
within those k steps. The difficulty lies in the fact that the information state 
taken in fc-step increments is not the same as the information state of the 
fc-step chain. □ 

It is our belief that the information state converges in all but a small number 
of pathological examples; however, we are only able to prove it in the above 
cases. If the information state does not converge, then it does not make 
sense to consider a limiting expected entropy. However, it is possible that 
a Cesaro sum of the expected entropy converges, and the limsup and liminf 
will certainly exist. Alternatively, we could simply work with a finite time 
horizon. 
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2.3 Special Case 



Continuing further within the general case has proven to be quite difficult, so 
we will restrict the remainder of our results to a special case, where there are 
two states and two observation processes with two possible observations each, 
and each observation process observes a different underlying state perfectly. 
Formally, 15*1 = \V\ = \0\ =2, and the transition and observation matrices 



are 



T 



1 - b 





1 




1 - g q 




p 1 — p 




1 



(2.35) 



In order to exclude trivial cases, we will require that the parameters a, b, p 
and q are contained in the open interval (0, 1), and a + 6 7^ 1. 



Note that this special case exactly corresponds to the problem of searching 
for a moving target studied by MacPhee and Jordan |Tl], although our cost 
function, limiting expected entropy, is very different from theirs, expected 
cumulative sum of prescribed costs until the ffist zero-entropy observation. 



We will begin by proving that given this restriction, the information state 
always converges in distribution, except for one case which is pathological in 
a sense that will be explained later. This proof is both a correction and an 
improvement of the proof given in [18]. 

In the special case, the space V{S) is a 1-simplex embedded in M^, which we 
can identify with the interval [0, 1], via equating each point z G [0, 1] with 
the point z6{0) + (1 — z)6{l) G V^S), so that z represents the mass at in 
the distribution. 
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By substituting these parameters into into Definition 2.14 the transition 
function in the special case is 

F{ii)= [ ao{z)5{ro{z))dfi{z)+ [ ai{z)5{niz))dfxiz) (2.36) 
+ [ (1 - ao(^))rf/i(^)5(0) + [ il-ai{z))dn{z)5{l), 

J Aq JAi 

where: 

• ao{z) = {1 — p){a + b — l)z + 1 — b + pb; 

• ai{z) = {1 - q){l - a - b) + b + q - bq; 

, , {a + b-l)z + l- b 

• ro{z) = — ; and 

ao{z) 



ri{z) 



q{a + b — l)z + q — qb 

ai{z) 



Note that the second hne of (2.36) consists of two point masses at and 1, 
which is a feature of the anchoredness condition. In the special case, it allows 
us to represent the location of masses by only two r-functions. 

We will continue to use the symbols ao, ai, vq and ri in their meaning 
above for the remainder of our discourse. Note that we have simplified the 
notation, in that vq represents the r-function ro,o, while ro,i does not appear 
since ^0,1(2;) = identically, and similarly for the other symbols. 

Since the special case satisfies the conditions of positivity and anchoredness. 



Proposition 2.22 applies, so irreducibility and aperiodicity of the information 
chain are sufficient for ergodicity. We now show that this occurs in all but 
two exceptional cases: 

• Case 1: Each orbit is contained entirely in the policy region which 
maps to that orbit, that is, Rq C Aq and Ri C Ai. Note that by 

38 



( 2.36[ ), the a-functions are always strictly positive, so we must have 
Ro = {0, r(0), r2(0), . . . } and i?i = {l, r(l), r2(l), . . . }. 

• Case 2: The orbits alternate periodically between policy regions, that 
is, {I,r(0),r2(l),r3(0),...} G Ao and {0,r{l),r\0),r^{l), . . .} G A, 
where r is the combined r-function r{z) = ro{z)lAo{z) + ri{z)lAi{z). 

Let Case denote the general case when neither Case 1 nor Case 2 occurs. 

Lemma 2.26. The chain F\fi has only one irreducible recurrent class, ex- 
cept in Case 1, where it splits into two irreducible recurrent classes, both of 
which are positive recurrent. 



Proof. By Proposition 2.22 without loss of generality, assume that the state 
is positive recurrent. 

If 1 is reachable from 0, that is, there is some t such that (-F|/j)*(0, 1) > 0, 
then is also reachable from 1 since is recurrent, hence and 1 are in the 



same irreducible recurrent class. By (2.36), either or 1 is reachable from 



every state, so there cannot be any other recurrent classes. 



Otherwise, if is reachable from 1 but 1 is not reachable from 0, then 1 
is transient, and furthermore, all of Ri is transient since any r'^(l) G -Ri is 
reachable only via the transient state 1, while all of Rq is reachable from the 
recurrent state and hence forms the only irreducible recurrent class. 

Finally, if and 1 are both unreachable from each other, then it must be 
the case that Ro C Aq and Ri ^ Ai, in which case the chain splits into two 
irreducible classes Rq and Ri, both of which are positive recurrent by the 



argument in Proposition |2.22[ □ 

39 



Lemma 2.27. The recurrent classes are aperiodic, except in Case 2, where 

the chain consists of a single irreducible recurrent class with period 2. 

Proof. If any recurrent class is periodic, then at least one of or 1 is recurrent 
and periodic; without loss of generality suppose it is 0. Since is periodic, it 
cannot reach itself in 1 step, so must be contained in Ai, thus 1 is reachable 
from and hence is in the same irreducible recurrent class. Note that by the 
same argument, 1 e ^40, thus reaches itself in 2 steps and hence the period 
must be 2. 

This means cannot reach itself in an odd number of steps, so r'^(O) e Ai 
when k is even and r'^(O) e when k is odd, and similarly for the orbit of 
1, which is the only possibility of periodicity. □ 

Thus, there is one exception in which the information chain is periodic, and 
another in which it is reducible. As will be evident, both exceptions can be 
remedied. We begin by giving a class of policies under which reducibility 

cannot occur. This class of policies is simple enough to be easily analysed, 
and as will be conjectured later, always includes the optimal policy in any 
hidden Markov model within the special case. 

Definition 2.28. A pohcy g is called a Threshold PoUcy if its preimages 
Aq — 5'~^{0} and Ai — 5'"^!} are both intervals. 

A threshold policy is indeed given by a threshold, since there must be some 
boundary point between Aq and Ai, such that one observation process is used 
on one side and the other is used on the other side. Outside the threshold 
case, it is unclear what the equivalent definition would be, since the concept 
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of an interval does not generalise easily to higher dimensions. 

Lemma 2.29. The linear fractional transformations tq and ri satisfy the 
inequality Tq^z) > ri{z) for all z e [0, 1]. 

Proof. We can write ro{z) = p((,_,),^^g't)H|a+b-i).+i-b - Note that the co- 
efficient {1 — a)z + b{l — z) of p is strictly positive and in the denominator, 
while ri{z) is exactly the same with instead of p. Since p,q E (0,1), 
p < 1 < q~^, hence ro(^) > ri{z). □ 

Lemma 2.30. The linear fractional transformations ro and ri are both 
strictly increasing when a + b > 1 and both strictly decreasing when a + b < 1. 

Proof. The derivative of ro is r'Q{z) = ^(^i_p)^a+b^i)~z+i-b+pbr ' ^^ich is positive 
everywhere if a-|-6 > 1 and negative everywhere if a + b < 1. The same holds 
for ri, since it is identical with q~^ instead of p. □ 

Lemma 2.31. The linear fractional transformations Tq andri have unique 
fixed points rjo and rji, which are global attractors of their respective dynamical 
systems. 

Proof. Split the interval [0, 1] into open subintervals at its interior fixed 
points, noting that by inspection, the two boundary points are not fixed. 
Since linear fractional transformations have at most two fixed points, there 
must be between one and three such subintervals, all non-degenerate. 

By continuity, in any such subinterval /, either ro(^) > z for all ^ e /, or 
ro{z) < z for all z e I. Since ro(0) > 0, ro{z) > z for all points z in the 
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leftmost subinterval, and since ro(l) < 1, to(-2) < z for all points z in the 
rightmost subinterval. This means that there are at least two subintervals. 

If there were three subintervals, Ji, I2 and /s in that order, then either 
ro(z) > z for all z G l^-, or ro(z) < z for all z G l^- In the first case, the fixed 
point between Ji and I2 is attracting on the left and repelling on the right, 
and in the second case, the fixed point between I2 and Is is repelling on the 
left and attracting on the right. However, such fixed points can only occur for 
parabolic linear fractional transformations with only one fixed point. This is 
a contradiction, hence there cannot be three subintervals. 

Thus, there are exactly two subintervals, and exactly one fixed point, which 
must then be an attractor across all of [0, 1]. □ 

Lemma 2.32. The fixed points satisfy ?7i < r/o- 



Proof. By Lemma 2.29, ri(?7o) < ro{riQ 



First consider the case a + b > 1. Applying Lemma |2.30| k times gives 
'"i^^(^o) < i"i{Vo)^ so the orbit of r/o under ri is monotonically decreasing. 



but it also converges to r]i by Lemma 2.31 hence r]i < r]Q. 



In the remaining case a + b < 1, suppose rji > tjq. Then by Lemma [2.30 
Vi = T"i{Vi) ^ ''^i('7o) < ^70) which is a contradiction, hence rji < r/o- Q 



Proposition 2.33. The first exception to Theorem 2.24 . Case 1, cannot 
occur under a threshold policy. 



Proof. Suppose i?o ^ Aq, -Ri C Ai, and the policy is threshold. Since G -Rq 
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and 1 G -Ri, it follows that every point of Rq is less than every point of Ri. 
Since Rq and -Ri are the orbits of and 1 under rg and vi respectively, by 



Lemma 2.31, they have limit points rjo and rji respectively, hence r/o ^ Vi- 
This contradicts Lemma 12.321 hence this cannot occur. □ 



The remaining exception is when the information chain is periodic with pe- 
riod 2, in which case the expected entropy oscillates between two limit points. 
The limiting expected entropy can still be defined in a sensible way, by tak- 
ing the average, minimum or maximum of the two limit points, depending 
on which is most appropriate for the situation. Thus, for threshold policies, 
it is possible to define optimality without exception. 

We conclude this section by writing down a closed form general expression 
for the limiting expected entropy. 



Proposition 2.34. Under the conditions of Theorem 2.24. ^hat is, in Case 
0, the limiting expected entropy of a policy is given by 

_ C('\H) + CW{H) 

C(o)(l) + C«(l) ' ^ ' 

where, for i G {0, 1}.- 

• H{z) = —z\ogz — {l — z)\og{l — z) is the entropy function andl{z) = 1 
is the constant function with value 1; 

• r{z) = ro{z)lAoiz)+ri{z)lA,iz) and a{z) = aQ{z)l a^X^) + oii{z)l aA^) 
are the combined r-function and combined a-function respectively, with 



ro, ri, ao and ai defined as in (2.36); 



z^'' G V{S) = [0, 1] are defined by the 



recursion 



4^ = h 4ti = K4^^); (2.38) 
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cf* G M are defined by the recursion 



1, c«i = (2.39) 



C^^ : C{V{S)) ^ M zs i/ie linear functional 

C^Hf) = E -«^(4"^^))l{4-»)eA.| E (2.40) 

A;ez+ fcez+ 



Proof. By Theorem 2.24, a unique invariant probability measure Z^o exists, 



so it suffices to show that 

_ CW(^) + CW((5) 
~ C(o)(l) + CW(l)' ^ ^ 

where 5 : 'P{S) — )■ A^(P(S')) is the Dirac measure, and C-*-' is extended in 
the obvious way to a hnear functional C{V{S), M{V{S))) MiV{S)). 



By Proposition 2.21 the invariant measure is supported in the combined 



orbit set R, and z^'^ is the only point which can make a one-step transition 
under the restricted information chain to z^^-^^, with probability a{z^^). 
Since the masses at z^^^ and z^/^^-^ are invariant, we must have 

F{Z^ = = ^i4>{Zoo = 4^^). (2.42) 

It then follows that for some constants 

= E cTs{4') + B^'^ E cW5(.«). (2.43) 

The mass at z^^^ is B'^'^^c^^ , which makes a transition under to in 
one step with probability (1 — ao(4*^))-''-{2(*)eAo}' ^^^"^^ invariant 
measure, mass is conserved at Zq*^^ = 0, hence 

E E - "(^i^^))l{4')eA.| = ^2.44) 
je{o,i} fcGZ+ 
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Since c^.*"* — > as /c — > oo, by telescoping series, 



(2.45) 



fcez+ 



Multiplying the right hand side of (2.44) by (2.45) with z = 0, then collecting 
coefficients of B^^'> and B^^\ yields 



s<»> j: ef (1 - «(4°>))i{.,.,,.,} = iJ'" E - -(4"))i{4.,,,,}. 

(2.46) 

Note that conservation of mass at 5(0) = 1 is now automatic, since mass is 
conserved on all of R and at every other point in R. The second equation 
comes from requiring Z^o to have total mass 1, that is. 



(2.47) 



The solution to (2.46) and (2.47) is exactly the required result. We note that 



the denominator is zero only when z^^^ G Aq and z^}^ G Ai for all k G Z+, 



which is exactly the excluded case Rq C Aq and Ri ^ Ai. 



□ 



This proof can be generalised easily to the case of more than two observation 
processes, as long as each one is anchored with only two states. 
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3 Computational Results 



3.1 Limiting Entropy 



We now present computational results, again in the special case, which will 
illustrate the nature of the optimal policy in the case of minimising limiting 
expected entropy. Since any such discussion must cover how best to calculate 
the limiting expected entropy given a hidden Markov model and a policy, this 
is a natural place to start. 



The simpliest approach is to use the formula in Proposition 2.34, noting 
that each of C^^^H), C^^^H), C^^\l) and C^^\l) is a product of two infi- 
nite series, each of which is bounded by a geometric series and hence well- 
approximated by truncation. Specifically, we write the limiting expected 
entropy as 

jj CiHq + CqHi 
tin 



' oo 



Ci/q + CqIi 



where: 

fcez+ 

k&Z+ 

.C.= $:4''(l-aU4-'))l{4.,,,,_.}. 



We can simplify the calculation slightly by recursively updating c^' and , 
storing them as real- valued variables q and Zi. 
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Algorithm 3.1. Estimation of limiting expected entropy. 

1. Define as functions the entropy H{z) — —zlogz — (1 — 2;)log(l — z), 
the mixed r-function r{z) — lAoiz)ro{z) + lAi{z)ri{z) and the mixed 
a-function a{z) — lAo{z)ao{z) + lAi{z)ai{z); 

2. Pick a large number N; 

3. Initialise variables Co, Ci, Hq, Hi, Iq and Ii to 0, cq and ci to 1, zq to 
and Zi to 1; 

4. Repeat the following + 1 times: 

(a) Add Co to Iq and ci to I\] 

(b) Add cqH{zq) to Hq and ciH{zi) to Hi] 

(c) If zq G Ai, then add co(l — a\{zQ)) to Co, and if z\ e Ao, then add 
Ci(l - q;o(^i)) to Ci; 

(d) Let Zq — r{zQ) and Zi = r{zi); 

(e) Multiply Cq by q;(^o) and Ci by q;(^i); 

5. The limiting expected entropy H^o is estimated by 

CiHq + Coi/i 



Proposition 3.2. T/ie estimate H^[N) satisfies the error hound 



(l-a)4g(7V)2' 

where Q{N) — C^Iq + Co-^i 'is the denominator of H^(N), and 

a = supQ!j(^) = 1 — min < 6(1 — p), (1 — a)(l — p), (1 — — g), a(l — [• 
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Proof. The formula for a follows from Definition |2.14| since the a-functions 
are linear and hence their maxima occur at the endpoints z = or z = 1. 
Note that a < 1 follows from the requirement that a,b,p,q G (0,1). Since 
Q{N) is monotonic increasing, this implies that the error bound is finite and 
vanishes as — )■ oo. 



Since each series has non-negative summands, each truncated series is smaller 
than the untruncated series, hence 



CiHq + CqHi 



CiIq + CqIi 



< 



{CiHq + CqHi){CiIq + Cq/i — Ci/q — CqIi] 

(Ci/o + Co/i)(Ci/o + ao/i) 
CiHq + CqHi 



(Ci/o + Co/l)2 



-{{Cih - CJo) + {Coh - doh)). 



The kth summand in each series is bounded by c^*-* < a*^, hence 



Clio - CJo = h{Ci - Ci) + Ci(/o - Jo) - (Ci - Ci)(/o - Jo 



< 



a 



N 



1 a 



N 



2a 



N 



1 — al — a 1 — al — a 
The same bound holds for Cq/i — Cq/i, hence 



Hnr, — 



CiHq + CqHi 



Clio + Cq/i 



< 



ia 



N 



(l-a)4Q(iV)^ 



Similarly, since Q{N) = CJo + Coh < 2/(1 - a) 



HUN) - 



C1II0 + CoHi 



Clio + Coh 



{CiHo — CiHq) + {CoHi — CoHi] 



Cih + Coh 



< 



Aa 



N 



< 



■ia 



N 



(l-a)2Q(iV) - (l-a)4Q(iV)2- 
Combining via the triangle inequality gives the required bound. 



□ 



Note that this bound depends only on the quantities a and Q{N), which are 
easily calculated. In particular, this allows us to prescribe a precision and 
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calculate the limiting expected entropy to within that precision by running 
the algorithm with unbounded N and terminating when error bound reaches 
the desired precision. Furthermore, since Q{N) appears only in the denom- 
inator and grows monotonically with N, we can replace it with Q{Nq) for 
some small, fixed A^o to calculate a prior a sufficient number of steps for any 
given precision. 

Example 3.3. In later simulations, we will pick a,b,p,q G [0.025,0.975], 
so that a < 0.999375. This gives an error bound of 

\Hoo{N) -Hoo\<{lQx 400^)Q(iV)-2(0.999375)^. 

While the constant appears daunting at first glance, solving for a prescribed 
error of 10~^° gives 

log 16 + 4 log 400 - 2 log g(iV) + iV log 0.999375 < -10 log 10. 

Hence, we require 

N > 79598 + 3199 | log Q(iV) | . 

For any realistic value of Q{N), this is easily within computational bounds, 
as each iteration requires at most 36 arithmetic operations, 2 calls to the 
policy function, and 4 calls to the logarithm function. 

An alternative approach to estimating limiting expected entropy would be 
to simulate observation of the hidden Markov model under the given policy. 
The major drawback of this method is that it requires working with the 
information state Zt, which takes values in V{V{S)) — V{[Q, 1]), the set of 
probabihty measures on the unit interval, which is an infinite dimensional 
space. 
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One possibility is to divide [0, 1] into subintervals and treat each subinterval 
as a discrete atom, but this produces a very imprecise result. Even using 
an unrealistically small subinterval width of 10~^, the entropy function has 
a variation of over 10~^ across the first subinterval, restricting the error 
bound to this order of magnitude regardless of the number of iterations. 



In comparison. Example 3.3 shows that the direct estimation method has 



greatly superior performance. 

An improvement is to use the fact that the limiting distribution Z^o is dis- 
crete, and store a list of pairs containing the locations and masses of discrete 
points. Since any starting point moves to either or 1 in one step, at the 
A^th iteration, the list of points must contain at least the first points in 
the orbit of either or 1. Each such point requires a separate calculation at 
each iteration, and thus the number of computations is 0(A^^) rather than 



0{N) as for Algorithm 3.1 



Since the number of iterations A^ corresponds to the last point in the orbit 
of or 1 which is represented, for any given A^, this method differs from 
the direct computation method only in the masses on these points, thus we 
would expect the relationship between precision and number of iterations to 
be similar. Since the simulation method has quadratically growing number 
of computations, this would suggest that it is slower than the direct compu- 
tation method, and indeed, this is also indicated by empirical trials. 

We will use the direct computation method of estimating limiting expected 
entropy for all of our results. 
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3.2 Optimal Threshold Policy 



The problem of finding the policy which minimises limiting expected entropy 
is made much easier by restricting the search space to the set of threshold 
policies, as these can be represented by a single number representing the 
threshold, and a sign representing which observation process is used on which 
side of the threshold. 

The simplest approach is to pick a collection of test thresholds uniformly 
in [0, 1], either deterministically or randomly, and test the limiting expected 
entropy at these thresholds, picking the threshold with minimal entropy as 
the optimal threshold policy. However, this method is extremely inefficient. 
Proposition |2.21 shows that the policy only matters on the countable set 



R G [0, 1], so moving the threshold does not change the policy as long as it 
does not move past a point in R. 



As shown in Figures 3^-3/7, points in R tend to be quite far apart, and thus 
the naive approach will cause a large number of equivalent policies to be 
tested. On the other hand, points in R close to the accumulation points are 
closely spaced, so even with a very fine uniform subset, some policies will be 
missed when the spacing between points in R becomes less than the spacing 
between test points. 

A better way is to decide on a direction in which to move the threshold, and 
select the test point exactly at the next point in the previous realisation of 
R in the chosen direction, so that every threshold in between the previous 
point and the current point gives a policy equivalent to the previous policy. 
This ensures that each equivalence class of policies is tested exactly once. 
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thus avoiding the problem with the naive method. 



However, a new problem is introduced in that the set of test points depends 
on the iteration number A^, which determines the number of points of R that 
are considered. This creates a circular dependence, in that the choice of N 
depends on the desired error bound, the error bound depends on the policy, 
and the set of policies to be tested depends on A^. We can avoid this problem 



by adapting Proposition 3^ to a uniform error bound across all threshold 
policies. 

Proposition 3.4. For a threshold policy and N > L, the error is 

\H^{N)-Hj < 



a2^(l-a)6' 

where L is the smallest integer such that Tq (0) > (1), and 



a = inf ai{z) = 1 — max < b{l — p), (1 — a)(l — p), (1 — — q),a{l — q) 



Proof. First note that L exists, since by Lemma [2.32[ iterations of tq and ri 
converge to rjo > rji respectively. 



Using Proposition 3.2, it suffices to prove that for > L, 

Q{N) = CJo + Coh > - «). 

It is not possible for {0, r(0), . . . , r^(0)} C Aq and {1, r(l), . . . , r^(l)} C Ai, 
since this would mean r^(0) = Tq (0) and r^(l) = t^(1), which gives the 
ordering < r^{l) < r^{0) < 1, but Aq and Ai are intervals for a threshold 
policy. 

Hence, either zf''' = r^(0) G Ai or z^^^'' = r^(l) G A^ for some £ < L < N. If 
zf G Ai, then Co > cf^(l - ai{zf ^)) > a^{l - a). Since h > c[,^^ = 1, this 
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gives Q{N) > a^[l — a), as required. A similar argument holds in the case 
zl^^ G Ao. □ 

The existence of this uniform bound for Q{N) in the threshold case is closely 



related to Proposition 2.33 which states that the exception Case 1, where 



-Ro C Aq and Ri G Ai, cannot occur in a threshold policy. In this exceptional 



case, Proposition |2.34| does not hold, as the denominator is zero, and hence 
Q{N) = for all A^. The fact that this cannot occur in a threshold policy is 
the key ingredient of this uniform bound. 

Now that we have an error bound which does not depend on the policy, we can 
determine a uniform number of iterations that will suffice for estimating 
the limiting expected entropy for all threshold policies. This reduces the 
search space to a finite one, as each point in the orbits of and 1 must be in 
one of the two policy regions, hence, there are at most 2^^"*"^ policies. Most 
of these will not be threshold policies, but since orbit points need not be 
ordered, there is no obvious bound on the number of threshold policies that 
need to be checked. Simulation results later in this section will show that in 
most cases, the number of such policies is small enough to be computationally 
feasible. 

Definition 3.5. The Orientation of a threshold policy is the pair of 
whether Aq is to the left or right of Ai, and whether the threshold is included 
in the left or right interval. Let [y4o)[Ai], [y4o](Ai], [Ai)[y4o] and [Ai](y4o] de- 
note the four possibilities, with the square bracket indicating inclusion of the 
threshold and round bracket indicating exclusion. 

Our strategy for simplifying the space of threshold policies that need to 
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be considered is to note that the poHcy only matters on R, the support 
of the invariant measure. Although R depends on the policy, for a given 
orientation and threshold t, any policy with the same orientation and some 
other threshold t' such that no points of R lie between t and t' is an equivalent 
policy, in that sense that the invariant measure is the same, since no mass 
exists in the region where the two policies differ. 

Thus, for each orientation, we can begin with t — 1'^, and at each iteration, 
move the threshold left past the next point in R, since every threshold in 
between is equivalent to the previous threshold. Although R changes at 
each step, this process must terminate in finite time since we already showed 
that there are only finitely many policies for any given N, and by testing 
equivalence classes of pohcies only once, it is likely to that far fewer steps 
are required than the 2^^"''^ bound. 

Furthermore, since i? is a discrete set, every threshold policy has an interval 
of equivalent threshold policies, so we can assume without loss of generality 
that the threshold is contained in the interval to the right, that is, only test 
the orientations [Ao)[^i] and [Ai)[Ao]. 

Algorithm 3.6. Finding the optimal threshold policy. 

1. Find L, the smallest integer such that r^{0) > rf(l), by repeated 
application of ro and ri to and 1 respectively; 

2. Prescribe an error E and determine the number of iterations 

^ _ log^-logl6 + 2Lloga + 61og(l 

logo; 

3. Start with the policy — [0,^) and Ai — [t, 1] with t — 1+, that is. 
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Aq is the whole interval and Ai is empty but considered as being to the 
right of Aq, and loop until t = 0: 



(a) Run Algorithm 3.1, and let the next value of t be the greatest z^*"* 
which is strictly less than the previous value of t; 

(b) If entropy is less than the minimum entropy for any policy en- 
countered so far, record it as the new minimum; 

4. Repeat with Aq considered to be on the right of Ai. 

We now calculate the location of the optimal threshold for a range of values 
of the parameters a, b, p and q. The results of this calculation, which will 



be presented in Figure 3^, is the primary content of this section, as it will 
give an empirical description of the optimal threshold policy. Since we have 
not been able to prove optimality analytically, except in the symmetric case 



of Proposition 3.8, this empirical description will provide our most valuable 



insight into the problem. 

In order to facilitate understanding, we will define six classes of threshold 
policies, depending on the location and orientation of the threshold. Di- 
agrams of the invariant measure under each of these classes, presented in 



Figures 3.2 3.7, will demonstrate that these classes of thresholds are quali- 



tatively different, which will further manifest itself in the next section. 

We partition the interval into three subintervals with endpoints at the attrac- 
tors rjo and rji, noting that this produces three regions which consist entirely 
of a single equivalence class of policy. We also assign colours to these regions 



to accommodate the data presented in Figure 3.8 
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Definition 3.7. The six Threshold Regions are defined by partitioning 
the set of all threshold policies first by whether the orientation is [ylo)[v4i] or 
[yli)[y4o], then by the location of the threshold t in relation to the accumu- 
lation points rji and rjo, with inclusion of endpoints defined more precisely 
overleaf. We number them using Roman numerals I through VI. 

Note that if either t < or t > 1, then either Aq or Ai occupies the whole 
interval, depending on the orientation. In particular, [ylo)[Ai] with t = 
is equivalent to [Ai)[Ao] with t = 1+, since in both cases Ai is the whole 
interval, and similarly [Ao)[Ai] with t = 1+ is equivalent to [Ai)[Ao] with 
t = 0. 

Thus, the space of all possible threshold policies consists of two disjoint 
intervals [0, l"*"], each of whose endpoints is identified with an endpoint of the 
other interval, which is topologically equivalent to a circle. To be technically 
correct, we note that identifying and 1+ does not present a problem, since 
we can simply extend the interval to [0, 1 + e] for some small e > 0, and 
identify and 1 + e instead. While this results in a subinterval of the circle 
corresponding to the same policy, this does not add additional complexity 
since every threshold policy has an interval of equivalent policies. 



This topology is illustrated overleaf in Figure 3.1, followed by precise defini- 



tions and examples of the six threshold regions in Figures 3.2 -3.7, and finally 



our computational results in Figure 3.8 
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Figure 3.1: Space of all threshold policies. The top line rep- 
resents the orientation [Ao)[Ai], while the bottom line represents 
the orientation [j4i)[j4o]. The right end of the top line and the left 
end of the bottom line are both the policy Aq = [0, 1] and Ai = 0, 
so we can paste them together, and similarly for the left end of 
the top line and the right end of the bottom line. Hence, we see 
that the set of threshold policies is topologically a circle. 
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Region I (represented by red in Figure 3.8): [y4o)[Ai] with t < r/i or [Ai)[y4o] 
with t > 1. When a + b > 1, every pohcy here is equivalent to the all-Ai 
pohcy, as the mass in the orbit of approaches 0. 




Figure 3.2: Invariant measure for the unique Region I policy 
with a = 0.8, b = 0.3, p = 0.5, q = 0.3. Entropy is 0.3145. Under 
the evolution function, any mass eventually enters the grey region 
since it contains both accumulation points in its interior, after 
which it cannot escape, hence in the limit, there is zero mass in 
the orbit of 0, and the pohcy is equivalent the all-^i policy. 
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Region II (yellow): [Ao)[Ai] with rji < t < rjQ. This is the most difficult 

to understand of the threshold policies, as the orbits do not converge to the 
accumulation points ?7o and ?7i, but rather, oscillate around the threshold t. 




Figure 3.3: Invariant measure for a typical Region II policy with 
a = 0.8, h = 0.3, p = 0.5, q = 0.3. Entropy is 0.3251. Note that 
the masses do not converge to the accumulation points. 
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Region III (green): [Ao)[^i] with t > r}o or [^i)[^o] with t < 0. When 

a + 6 > 1, every poUcy here is equivalent to the all-Ao pohcy, since the mass 
in the orbit of 1 approaches 0. 



Ao, 




Figure 3.4: Invariant measure for the unique Region III policy 
with a = 0.8, b = 0.3, p = 0.5, q = 0.3. Entropy is 0.3337. Under 

the evolution function, any mass eventually enters the white region 
since it contains both accumulation points in its interior, after 
which it cannot escape, hence in the limit, there is zero mass in 
the orbit of 1, and the policy is equivalent to the all-^o policy. 
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Region IV (cyan): [Ai)[Ao] with < t < r/i. For a policy in this region, the 

first finitely many points of the orbit of belongs to Ai, while every other 
point lies in ^40. Note that t < is included in Region III. 




Figure 3.5: Invariant measure for a typical Region IV policy 
with a = 0.8, b = 0.3, p = 0.5, q = 0.3. Entropy is 0.3275. The 

orbit of 1 approaches rjo, while the orbit of follows an approach 
sequence to rji for a finite number of steps (in this case 2 steps) 
before also approaching 
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Region V (blue): [Ai)[74o] with rji < t < rjo. When a + b > 1, every pohcy 
here is equivalent to the policy Rq C Ai and i?i C Aq. 





Ao 

70 




^ — ^ ^^^^^1 








Ho 1 


Figure 3.6: Invariant measure for the unique Region V policy 



with a = 0.8, b = 0.3, p = 0.5, q = 0.3. Entropy is 0.3265. Note 
that the orbit of converges to rji while the orbit of 1 converges 
to riQ. 
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Region VI (magenta): [74i)[74o] with rjo <t < 1. For a policy in this region, 
the first finitely many points of the orbit of belongs to Ai, while every 
other point lies in Aq. Note that t > 1 is included in Region I. The lack of 
symmetry with Region VI in terms of strict and non-strict inequalities is due 
to the choice that the threshold itself be included in the right region. 




Figure 3.7: Invariant measure for a typical Region VI policy 
with a = 0.8, b = 0.3, p = 0.5, q = 0.3. Entropy is 0.3182. The 
orbit of approaches r/i, while the orbit of 1 follows an approach 
sequence to 770 for a finite number of steps (in this case 1 step) 
before also approaching 771. 
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Figure 3.8: Location of the optimal threshold policy. The four- 
dimensional parameter space is represented with a on the major 
horizontal axis increasing rightwards, b on the major vertical axis 
increasing downwards, p on the minor horizontal axis increasing 
rightwards and q on the minor vertical axis increasing downwards. 
Each parameter is sampled by {0.025,0.075, . . . ,0.975}, omitting 
the trivial cases when a + b = 1, resulting in 152000 sample points. 



Colours representing regions are defined in Figures 3.2-3.7 
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The bound in Proposition |3.4| is consistent with the empirical running time 



for the calculations used to generate Figure 3.8 Our program experienced 
significant slowdowns when either a and p, or h and q were large, with the 
slowest occuring at the extreme point a = b = p = q = 0.975. This suggests 



that a uniform bound when a — )■ 1 is impossible, and that Algorithm 3^ will 
fail for sufficiently large values of a. On the other hand, either a = 1 or 6 = 1 
results in trivial limiting behaviour of the underlying chain, so this situation 
is unlikely to occur in any practical application. 



Note that in Figure 3.8, the half of the main diagonal where a + 6 > 1 is 
entirely blue. This can be proven analytically. We note that the condition 
a + b > 1 corresponds to the underlying Markov chain having positive one- 
step autocorrelation, which is a reasonable assumption when the frequency of 
observation is greater than the frequency of change in the observed system, 
since in this case, one would not expect the system to oscillate with each 
observation. 

Proposition 3.8. In the symmetric case a = b and p = q, with a + b > 1, 
the optimal general policy is the unique Region V policy gy given by Rq C Ai 
and Ri G Aq. 



Proof. Proposition 2.34 gives the limiting expected entropy as a convex com- 



bination of two quantities, hence 



> min{ \. (3.1) 

'k 



l^k l^k ^k 



Note that equality is realised when the two quantities above are equal, which 

4°^ = ^^cl zf^ = 1 - z^^ 



occurs under gy, since in that case cf^ = c^^^ and zf^ = 1 — z^}'^ 
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Let Zk = Ck = cf^/ X] '^f^ ^^"^ = H{z'^^). In order to prove that gy 
is optimal, by symmetry, it suffices to prove that it minimises Ylk'^kHk- 

First, we show that for each k and any other pohcy g, Hk{gv) < Hk{g), 
where the notation Hk{g) makes the dependence on the pohcy exphcit. 



By Lemma 2.29, Zk{g) = r'^(O) > rJ^(O) = Zk{gv)- By Lemma 2.30 and our 
assumption that a + b > 1, iterations of tq and ri approach their hmits 
monotonically, hence Zk{g) < ?7o and z^^gy) < rji. These inequahties are 



illustrated in Figure 3.9 



Combining these inequalities gives Zk{gv) < ^k{g) < Vo- Entropy is concave 
on the interval [0, 1], and therefore on the subinterval [zk{gv),Vo]^ hence 

H{zk{g)) > mm{H{zk{gv)),H{7]o)}. (3.2) 



Since Zk{gv) < Vi < l ^md entropy is increasing on [0, ^], H{zk{gv)) < Hij] 



which is equal to H{r]o) by symmetry. Hence, the inequality (3.2) reduces to 
H{zk{g)) > H{zk{gv)), that is, Hk{g) > Hk{gv)- 



Figure 3.9: Diagram showing Zk{gv) and Zk{g) in relation to 0, 
gi, go and 1. All positions are fixed except that Zk{g) may be to 
the left of gi. Since Zk{gv) lies to the left of gi and the diagram 
is symmetric, Zk{gv) has lower entropy than g^. Since Zk{g) lies 
between Zk{gv) and go and entropy is concave, Zk{g) has higher 
entropy than Zk{gv)- Hence the gy minimises the entropy at Zk- 
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Next, we show that for each k and any other pohcy g, 



co{gv) + ci{gv) H h Ck{gv) > co{g) + ci{g) H h Ck{g). (3.3) 



Let Ofc = a{z^^). Since a + 6 > 1, by (2.36), ai is decreasing, while ao{z) = 
ai{l — z) is increasing. We have already established that Zk{gv) < Zk{g) < tiq 
and Zk{gv) < = 1 - < 1 - ^fc(fl'), which implies ai{zk{gv)) > ai{zk{g)) 
and ai{zk{gv)) > "o(-2A:(fi')) respectively. This shows that ak{gv) > CLk^g)- 



For i < k, write 



Co H h Cfe 



1 + Oq + a^ai 



aQtti ■ ■ ■ ak-i _ ^ + cte^ 



1 + ao + floc^i + ■ ■ ■ + a^ai ■ ■ ■ ak-i + ■ ■ ■ X + agZ 
Since X > and < Y < Z , decreasing ag increases cq + ■ ■ ■ + Ck, and the 
same is true for £ > k, since in that case we can write the expression in the 



same way with Y = 0. This proves (3.3). 



Using (3.3) and Ho{gv) < Hi{gv) < H2{gv) < ■ ■ ■ , for any i eN, 



Si = ^ (ckigv) - Ckig))Hi{gv) + ^ {ckigv) - Ck{g))Hk{gv) 

k<l k>e 

< (cfc(fi'v) - Ckig))Hi+i{gv) + ^ {ckigv) - Ck{g))Hk{gv) = Se+i. 
k<e k>e 

Since J^k Cfc = 1 identically, the second series vanishes as £ — )■ oo, while the 



first series is always non-negative by (3.3), hence Sq < 0. Thus 



^Ckigv)Hkigv) < ^Ck{g)Hk{gv) < ^Ck{g)Hk{g). 



k k 

This proves the required minimisation. 



□ 



Note that the proof above relies heavily on the fact that equality is attained 
in (3.1 ). This occurs only in the symmetric case, and thus this approach does 
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not generalise readily. The complexity of the proof in the symmetric case is 
indicative of the difficulty of the problem in general, and thus highlights the 
importance of the empirical description provided by Figure 3.8 



In the course of performing the computations to generate Figure |3.8[ we no- 



ticed that entropy is unimodal with respect to threshold, with threshold con- 



sidered as a circle as in Figure 3.1 While we cannot prove this analytically. 



it is true for each of the 152000 points in the parameter space considered. 
This allows some simplification in finding the optimal threshold policy, since 



finding a local minimum is sufficient. Thus, we can alter Algorithm 3.6 to 
begin by testing only two policies, then testing policies in the direction of 
entropy decrease until a local minimum is found. However, the running time 
improvement is only a constant factor; if we model entropy as a standard 
sinusoid with respect to threshold, then the running time decreases by a 
factor of 3 on average. 
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3.3 General Policies 



The problem of determining the optimal general policy is much more difficult, 
due to the complexity of the space of general policies. Since a policy is 
uniquely determined by the value of the policy function at the orbit points, 
this space can be viewed as a hypercube of countably infinite dimension, 
which is much more difficult to study than the space of threshold policies, 
which is a circle. 

One strategy is to truncate the orbit and consider a finite dimensional hy- 
percube, justified by the fact that orbit points have masses which decay 
geometrically, and thus the tail contributes very little. However, a trunca- 
tion at N (that is, force the policy to be constant on {r^(0), r^+-'^(0), . . . }, 
and similarly for the orbit of 1) gives 2^^"*"^ possible policies, which is still 
far too large to determine optimality by checking the entropy of each policy. 

The next approximation is to only look for locally optimal policies, in the 
sense that changing the policy at each of the 2A'" -|- 2 truncated orbit points 
increases entropy, and hope that by finding enough such locally optimal poli- 
cies, the globally optimal policy will be among them. Since a hypercube has 
very high connectivity, regions of attraction tend to be large, which heuris- 
tically suggests that this strategy will be effective. 

Algorithm 3.9. Finding a locally optimal truncated pohcy. 

1. Pick N, and a starting policy, expressed as a pair of sequences of binary 
digits g^^^ = ^(4*)), with A; = 0, 1, . . . , TV; 

2. Cycle through the digits g^f!\ flipping the digit if it gives a policy with 
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lower entropy, otherwise leaving it unchanged; 

3. If the previous step required any changes, repeat it, otherwise a locally 
optimal truncated policy has been found. 

We picked = 63 since this allows a policy to be easily expressed as two 
unsigned 64-bit integers, and for each of the 152000 uniformly spaced param- 



eters of Figure 3.8, we generated 10 policies uniformly on the hypercube and 



apphed Algorithm 3.9 



None of the locally optimal policies for any of the parameter values had 



lower entropy than the optimal threshold policy from Figure 3.8, and on 
average 98.3% of them were equivalent to the optimal threshold policy, up 
to a precision of 0.1%, indicating that the optimal threshold policy is locally 
optimal with a very large basin of attraction, which strongly suggests that it 
is also the globally optimal policy. 

Conjecture 3.10. In the special case, the infimum of entropy attainable 
under threshold policies is the same as that under general policies. 

The fact that a large proportion of locally optimal policies have globally 
optimal entropy gives a new method for finding the optimal policy. By pick- 



ing 10 random truncated policies and running Algorithm |3.9[ at least one 
of them will yield an optimal policy with very high probability. Empirical 
observations suggest that this method is slower than Algorithm |3.6 



on av- 



erage, but since the success rate remains high while Algorithm 3^ becomes 
significantly slower as a approaches 1, this method is a better alternative for 
some parameter values. 
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Figure 3.10: Locally optimal policies. Axes are as in Figure 
Darkness increases with the proportion of simulated locally opti- 
mal policies which have the same entropy as the optimal threshold 
policy, up to a precision of 0.1%. The average is 9.83 out of 10, but 
the distribution is far from uniform — local optima are exceedingly 
likely to be the same as the threshold optimum for some param- 
eter values and exceedingly unlikely for others. The boundaries 



are approximately those of the threshold regions (see Figure 3.8), 
with some imprecision due to the non-deterministic nature of the 
simulation data. 
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One last policy of interest is the greedy policy. In the previous sections, 
we considered a long term optimality criterion in the minisation of limiting 
expected entropy, but in some cases, it may be more appropriate to set a 
short term goal. In particular, one may instead desire to minimise expected 
entropy at the next step, in an attempt to realise maximal immediate gain 
while ignoring future implications. 

Definition 3.11. The greedy policy is the policy such that the expected 
entropy after one observation is minimised. Up to an exchage of strict and 
non-strict inequalities, this is given by 

Ao = {ze [0, 1] : ao{z)H{ro{z)) < a,{z)H{r,{z))}, 
A,^{ze [0, 1] : ao(z)H(ro(z)) > ai(z)H(n(z))}. 

The greedy policy has the benefit of being extremely easy to use, as it only 
requires a comparison of two functions at the current information state. Since 
these functions are smooth, efficient numerical methods such as Newton- 
Raphson can be used to determine the intersection points, thus allowing the 
policy to be described by a small number of thresholds. 

In fact, only one threshold is required, as computational results show the 
greedy policy always a threshold policy. Using the 152000 uniformly dis- 
tributed data points from before, in each case the two functions defining the 
greedy policy crossed over at most once. 

Conjecture 3.12. The greedy policy is always a threshold policy. 

Idea of Proof . Note that qao{z)rQ{z) — ri{z)al{z). It may appear at first 
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glance that the factor of q violates symmetry, but recall that z maps to 1 — z 
under relabelling. 

Using this identity, the intersection points that define the greedy policy sat- 
isfy Gij-Q^z)) = qG{ri{z)), where G{z) = H{z)/z. It is easy to see that G is 
monotonic decreasing on [0,1] with range [0, cxd], hence F{z) = G~^{qG{z)) 
is a well-defined one-parameter family of functions mapping [0, 1] to itself 
with fixed points at and 1. 

On the other hand, f{y) = To(rj^^(y)) = y/ ((1 — pq)y + pq) is also a one- 
parameter family of functions mapping [0, 1] to itself with fixed points at 
and 1. Since the range of tq is contained in (0,1), we can discount the 
endpoints ?/ = and y = 1, hence it suffices to show that the equation 
F{y) = f{y) has at most one solution for y G (0, 1). 

Convexity methods may help in this last step but we have not been able to 
complete the proof. □ 



Even when the greedy policy is not optimal, it is very close to optimal. Of the 



152000 uniformly distributed data points in Figure |3.11| below, the greedy 
policy is non-optimal at only 6698 points, or 4.41%, up to an error tolerance 
of 10^^^. On average, the greedy policy has entropy 0.0155% higher than 
the optimal threshold policy, with a maximum error of 5.15% occuring at 
the sample point a = 0.975, b = 0.925, p = 0.675 and q = 0.375. Thus the 
greedy polices provides an alternative suboptimal policy which is very easy 
to calculate and very close to optimal. 
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Figure 3.11: Optimality of the greedy policy. Axes are as in 
Figure [STS] Light grey indicates that the greedy pohcy is the op- 
timal threshold policy; darker points indicate suboptimality with 
darkness proportional to error. Similarly to Figm'e [3.10[ the sub- 
optimal points lie on the boundaries of the threshold regions. 
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We make a final remark that the hkehhood of a locally optimal policy being 



globally optimal as shown in Figure 3.10, and the closeness of the greedy 



policy to the optimal threshold policy as shown in Figure 3.11, both exhibit 
a change in behaviour at the boundaries of the threshold regions as shown in 



Figure [378| This suggests that these regions are indeed qualitatively different, 
and are likely to be interesting objects for further study. 
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4 Conclusion and Future Work 



This thesis presents initial groundwork in the theory of hidden Markov mod- 
els with multiple observation processes. We prove a condition under which 
the information state converges in distribution, and give algorithms for find- 
ing the optimal policy in a special case, which provides strong evidence that 
the optimal policy is a threshold policy. 

Future work will aim to prove the conjectures that we were not able to prove: 

• The information state converges for an anchored hidden Markov model 
with only ergodicity rather than positivity; 

• The greedy policy is always a threshold pohcy; 

• Among threshold policies, the limiting expected entropy is unimodal 
with respect to threshold; and 

• The optimal threshold policy is also optimal among general policies. 

Possible approaches to these problems are likely to be found in [9], [10] and 
The author was not aware of these papers under after the date of 
submission, and thus was unable to incorporate their methods into this thesis. 

Better algorithms and error bounds for finding the optimal policy are also a 
worthwhile goal. Although our algorithms are computationally feasible with 
reasonable prescribed errors, our focus was on finding workable rather than 
optimal algorithms, and thus there is plenty of room for improvement. 

Another direction for future work would be to extend our results into more 
general cases. 
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